Add FileService as a standalone microservice, LakeFS+S3 as dataset storage #3296

bobbai00 · 2025-03-02T07:38:42Z

This PR introduces the FileService as another microservice parallel to WorkflowCompilingService, ComputingUnitMaster/Worker, and TexeraWebApplication.

Purpose of the FileService

We want to improve the performance of our current Git-based Dataset implementation;
We decide to go with LakeFS + S3, LakeFS for the version control metadata and S3 for data transfer; But LakeFS doesn't have access control layer
Therefore, we build the FileService, providing
- all the APIs related to versioned files in datasets
- access control

Architecture before and after adding FileService

Before:

After

Key Changes

A new service, FileService is introduced. All the dataset-related endpoints are hosted on FileService
Several configuration items related to LakeFS and S3 are introduced in the storage-config.yaml
Frontend UI updates to incorporate with new changes
For ComputingUnitMaster and ComputingUnitWorker, they will call FileService to read files, during which their access will be verified. So in the dynamic computing architecture (which will be introduced in Add computing unit manager service #3298), they will send requests along with current user's token. In single-machine architecture, they are bypassing the network requests by doing direct local function calls.
Python UDF can now directly read dataset's file by the following example code:

file = DatasetFileDocument("The URL of the file")
bytes = file.read_file() # return an io.Bytes object

You may refer to core/amber/src/main/python/pytexera/storage/dataset_file_document.py for implementation details. This feature is only available in the dynamic computing architecture.

How to migrate the previous datasets to the new datasets managed by the LakeFS

As we did quite some refactoring, two dataset implementations are NOT compatible with each others. To migrate the previous datasets to the latest implementation, you will need to re-upload the data via the new UI.

How to deploy new architecture

Step1. Deploy LakeFS & Minio

Use Docker (Highly recommended for local development)

Go to directory: core/file-service/src/main/resources
Execute docker-compose --profile local-lakefs up -d at its directory

Use Binary (Recommended for production deployment)

Refer to https://docs.lakefs.io/howto/deploy/

Step2. Configure the `storage-config.yaml`

Configure the below section in the storage-config.yaml:

  lakefs:
    endpoint: ""
    auth:
      api-secret: ""
      username: ""
      password: ""
    block-storage:
      type: ""
      bucket-name: ""

  s3:
    endpoint: ""
    auth:
      username: ""
      password: ""

Here is the configuration you can directly use if you are using the core/file-service/src/main/resources/docker-compose.yml to install LakeFS & Minio:

  lakefs:
    endpoint: "http://127.0.0.1:8000/api/v1"
    auth:
      api-secret: "random_string_for_lakefs"
      username: "AKIAIOSFOLKFSSAMPLES"
      password: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
    block-storage:
      type: "s3"
      bucket-name: "texera-dataset"

  s3:
    endpoint: "http://localhost:9000"
    auth:
      username: "texera_minio"
      password: "password"

Step3. Launch services

Launch FileService , in addition to TexeraWebApplication, WorkflowCompilingService and ComputingUnitMaster.

Future PRs after this one

Remove the Dataset related endpoints completely from amber package.
Incorporate the deployment of LakeFS+S3 in the Helm chart of K8s-based deployment.
Some optimizations:
- for small files, directly upload it instead of using multipart upload
- when doing result exports, use multipart upload as the result size can be quite big.
- support re-transmit for partially-uploaded files.

core/workflow-core/build.sbt

aglinxinyuan

Since installing LakeFS and Minio can be complex, could we add a frontend flag that allows developers to enable the user system without requiring LakeFS and Minio? This would let developers read files directly from their local file system when the user system is enabled.

core/workflow-core/src/main/resources/storage-config.yaml

bobbai00 · 2025-03-04T21:27:10Z

Since installing LakeFS and Minio can be complex, could we add a frontend flag that allows developers to enable the user system without requiring LakeFS and Minio? This would let developers read files directly from their local file system when the user system is enabled.

OK. I have added a flag in environment.default.ts

aglinxinyuan

LGTM!
Tested on both Windows and Mac. The setup is very smooth.
Please add more details on step3. For example, how can developers migrate to this from current master.

core/file-service/src/main/scala/edu/uci/ics/texera/service/util/S3StorageClient.scala

core/gui/package.json

core/amber/src/main/scala/edu/uci/ics/texera/web/service/ResultExportService.scala

core/file-service/src/main/scala/edu/uci/ics/texera/service/FileServiceConfiguration.scala

core/gui/src/app/dashboard/component/user/list-item/list-item.component.html

...pp/dashboard/component/user/user-dataset/user-dataset-explorer/dataset-detail.component.html

...pp/dashboard/component/user/user-dataset/user-dataset-explorer/dataset-detail.component.scss

core/workflow-core/build.sbt

bobbai00 changed the title ~~Add FileService as a standalone microservice, and LakeFS+S3 as dataset storage~~ Add FileService as a standalone microservice, LakeFS+S3 as dataset storage Mar 2, 2025

bobbai00 marked this pull request as ready for review March 2, 2025 22:16

bobbai00 requested review from shengquan-ni and aglinxinyuan March 2, 2025 22:28

aglinxinyuan reviewed Mar 3, 2025

View reviewed changes

core/workflow-core/build.sbt Show resolved Hide resolved

bobbai00 self-assigned this Mar 3, 2025

bobbai00 force-pushed the jiadong-add-file-service branch from 5db607e to 0e7a9d8 Compare March 3, 2025 18:39

aglinxinyuan reviewed Mar 3, 2025

View reviewed changes

core/workflow-core/src/main/resources/storage-config.yaml Outdated Show resolved Hide resolved

bobbai00 force-pushed the jiadong-add-file-service branch from 6d101a1 to da96a2a Compare March 4, 2025 21:26

aglinxinyuan approved these changes Mar 5, 2025

View reviewed changes

bobbai00 force-pushed the jiadong-add-file-service branch 2 times, most recently from 07d4789 to 2e38fad Compare March 9, 2025 21:53

bobbai00 added 16 commits March 9, 2025 21:39

add initial lake fs based implementation

72b39b1

move lakefs logic to workflow core

47fe1ab

add uri related and lake fs document

013cc56

fix bugs

9bcb8e4

a compilable version

df0b5de

a runnable version

9242a5a

finish jwt auth

59b46a1

make the backend work

d89e9c3

keep refactoring the dataset resource

f180f0c

succinct the config parsing

c329e3f

test more APIs and closing to finish

f45b602

fix dataset creation and version creation

1ae1c13

fix the presigned get

2ed989d

closing to finish the upload

c31cbcc

refactor dataset frontend

02a4057

finish upload

5367d9a

bobbai00 added 29 commits March 9, 2025 21:39

keep improving the backend and frontend

cc1718d

make the workflow be able to read from the dataset

8010bac

adding python side dataset reader

f3c4b35

keep improving the frontend

94f000d

clean up the frontend

893df71

finish the export

559b954

finalize the sharing feature

e050070

fix the delete

a2f39e4

recover the frontend change

9daac7d

fix test

049e9a9

fix backend dependency and fix frontend

3df767a

cleanup the storage config

07cb8cc

add more comments

685f034

save the multipart chunk change on frontend

9caae5d

recover gui changes

9471174

do the rebase

6eef903

add the flag for controlling whether to select files from dataset

d96296b

add default values for lakeFS+S3

5e6f9a8

fmt

fd84702

add file service to part of the scripts

86c9fcb

resolve comments and fix the py udf document

2b277fb

fmt python

fb07239

fmt and fix the version of docker compose

6b4c960

try to fix the cors issue

f8685ab

fmt py file

2d9eca1

add header for put

a422c1d

fmt UDF

e232423

keep refining

21a8969

update the docker compose

6accd5b

bobbai00 force-pushed the jiadong-add-file-service branch from 2e38fad to 6accd5b Compare March 10, 2025 05:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FileService as a standalone microservice, LakeFS+S3 as dataset storage #3296

Add FileService as a standalone microservice, LakeFS+S3 as dataset storage #3296

bobbai00 commented Mar 2, 2025 •

edited

Loading

aglinxinyuan left a comment

bobbai00 commented Mar 4, 2025

aglinxinyuan left a comment

Add FileService as a standalone microservice, LakeFS+S3 as dataset storage #3296

Are you sure you want to change the base?

Add FileService as a standalone microservice, LakeFS+S3 as dataset storage #3296

Conversation

bobbai00 commented Mar 2, 2025 • edited Loading

Purpose of the FileService

Architecture before and after adding FileService

Key Changes

How to migrate the previous datasets to the new datasets managed by the LakeFS

How to deploy new architecture

Step1. Deploy LakeFS & Minio

Use Docker (Highly recommended for local development)

Use Binary (Recommended for production deployment)

Step2. Configure the storage-config.yaml

Step3. Launch services

Future PRs after this one

aglinxinyuan left a comment

Choose a reason for hiding this comment

bobbai00 commented Mar 4, 2025

aglinxinyuan left a comment

Choose a reason for hiding this comment

bobbai00 commented Mar 2, 2025 •

edited

Loading

Step2. Configure the `storage-config.yaml`